Support: surface hung tests on pytest session timeout by ChaoWao · Pull Request #721 · hw-native-sys/simpler

ChaoWao · 2026-05-08T09:00:24Z

Summary

When the parent pytest dispatcher's session watchdog fired
(--pto-session-timeout), the CI log only showed the timeout banner.
The actually-stuck child subprocess was killed silently, its captured
stdout discarded, and the child was left to be reaped as an orphan by
the runner cleanup — see run 25541961688
for the symptom (9.5 min of dead air, then [pytest] TIMEOUT, no
indication which case was stuck).

This PR makes the timeout path self-diagnosing.

Changes

parallel_scheduler.py — emit [scheduler] START <label> pid=... devices=...
at launch (not only at completion). The last START without a matching
PASS/FAIL identifies the hung case at a glance.
parallel_scheduler.py — expose _active_state while run_jobs
is in flight so the parent's signal handler can reach the live job
table.
conftest.py — on session-timeout SIGALRM:
1. send SIGUSR1 to every in-flight child (faulthandler, registered
  in pytest_configure, dumps all-thread Python + C stacks into the
  child's stdout — works even when the GIL is held by a native NPU
  call, which Python-level watchdogs cannot reach);
2. let pumps drain ~2 s so the dumped stacks reach output_lines;
3. print each in-flight job's tail buffer inside a ::group::HUNG ...
  block;
4. call _terminate_all so children don't outlive us as orphans.

What the log will look like next time

Normal: every case has a [scheduler] START followed (out of order, in
parallel) by a ::group:: ... [PASS|FAIL ...] block.

On hang: the START for the stuck case has no matching PASS/FAIL, then
[pytest] TIMEOUT: ..., then a ::group::HUNG <label> pid=... elapsed=...
block containing the all-thread traceback the child wrote when it caught
SIGUSR1. The existing CI retry (any non-zero rc → re-run with the pinned
PTO-ISA commit) is unchanged: a PTO-ISA regression that surfaces as a
hang is exactly the case that retry path is meant to recover.

Notes

pytest-timeout per-test guard is intentionally not enabled; the
single-layer "parent catches everything via SIGUSR1" path covers all
hang shapes (including native NPU deadlocks where pytest-timeout's
watchdog thread can't fire). If hangs become frequent enough that
burning the full session budget per incident is painful, adding
timeout = N / timeout_method = "thread" to pyproject.toml is a
pure increment on top of this PR.
macOS has SIGUSR1; Windows does not. All new SIGUSR1 paths are
guarded by hasattr(signal, "SIGUSR1").

Testing

Local: trigger a hang test (time.sleep(9999)) under
--pto-session-timeout 30; confirm the CI log shows
::group::HUNG ... with traceback, rc=124, no orphan python
processes after exit.
CI: PR validation against simulation runners (no actual hang,
should look identical to before plus the new START lines).

When the parent pytest dispatcher's session watchdog fired, the log showed only the timeout banner — the actually-stuck child subprocess was killed silently, its captured stdout discarded, and the child was left to be reaped as an orphan by the runner cleanup. Triaging the stuck case required guessing. Three changes make the timeout self-diagnosing: - parallel_scheduler: emit a `[scheduler] START label pid=... devices=...` line at launch (not only on completion). The last START without a matching PASS/FAIL identifies the stuck case. - parallel_scheduler: expose `_active_state` while `run_jobs` is in flight so the parent's signal handler can reach the live job table. - conftest: on session-timeout SIGALRM, send SIGUSR1 to every in-flight child (faulthandler, registered in `pytest_configure`, dumps all-thread Python+C stacks into the child's stdout — works even when the GIL is held by a native NPU call), let the pumps drain ~2s, then print each in-flight job's tail buffer in a `::group::HUNG ...` block, and finally call `_terminate_all` so children don't outlive us as orphans.

gemini-code-assist

Code Review

This pull request enhances the test session timeout handling by implementing a mechanism to debug hung child processes. It introduces a module-global state in the parallel scheduler to track active jobs, allowing the timeout handler to send SIGUSR1 signals to stuck processes for stack trace generation via faulthandler. The changes also include improved logging for job starts and GitHub Actions-style grouping for hung process output. I have no feedback to provide as there were no review comments.

ChaoWao force-pushed the support/surface-hung-tests-on-pytest-session-timeout branch from 9525e7e to fccdced Compare May 8, 2026 09:02

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

ChaoWao merged commit 73b3fff into hw-native-sys:main May 8, 2026
14 checks passed

ChaoWao deleted the support/surface-hung-tests-on-pytest-session-timeout branch May 8, 2026 09:11

ChaoWao mentioned this pull request May 8, 2026

Refactor: hoist conftest lazy imports that have no rationale #722

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support: surface hung tests on pytest session timeout#721

Support: surface hung tests on pytest session timeout#721
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:support/surface-hung-tests-on-pytest-session-timeout

ChaoWao commented May 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoWao commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

What the log will look like next time

Notes

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChaoWao commented May 8, 2026 •

edited

Loading